Niels Richard Hansen
16/02-2016
The lasso acronym means
least absolute shrinkage and selection operator
and was introduced by Robert Tibshirani in
Regression Shrinkage and Selection via the Lasso
published in the first issue of Journal of the Royal Statistical Society: Series B, 1996.
You say las-SOO
and I say LAs-so.
From Spanish lazo.
We consider linear regression models of a response \( Y \) on a vector \( X = (X_1, \ldots, X_p) \) of predictors: \[ E(Y) = \mathbf{X}^T \beta= \sum_{j=1}^p X_j \beta_j. \]
With \( n \) observations we organize \( Y_1, \ldots, Y_n \) in a vector \( \mathbf{Y} \) and \( X_1, \ldots, X_n \) in a \( n \times p \) matrix \( \mathbf{X} \).
The least squares estimator is \[ \hat{\beta} = \mathop{\arg \min}\limits_{\beta} \|\mathbf{Y} - \mathbf{X}\beta\|_2^2 = \mathop{\arg \min}\limits_{\beta} \sum_{i=1}^n (Y_i - X_i^T \beta)^2. \] where \( \|\mathbf{z}\|_2^2 = \sum_{i=1}^n z_i^2 \) denotes the Euclidean norm.
The response, \( Y \), will be lpsa. The remaining 8 variables will be the predictors in \( X \).
lpsa lcavol lweight age lbph svi lcp gleason pgg45
1 -0.4308 -0.5798 2.769 50 -1.386 0 -1.386 6 0
2 -0.1625 -0.9943 3.320 58 -1.386 0 -1.386 6 0
3 -0.1625 -0.5108 2.691 74 -1.386 0 -1.386 7 20
4 -0.1625 -1.2040 3.283 58 -1.386 0 -1.386 6 0
All variables are standardized upfront.
prostate <- scale(prostate, TRUE, TRUE)
## Intercept removed due to standardization
prostateLm <- lm(lpsa ~ . - 1,
data = as.data.frame(prostate))
summary(prostateLm)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lcavol 0.5762 0.0892 6.46 5.4e-09 ***
lweight 0.2309 0.0741 3.11 0.0025 **
age -0.1370 0.0711 -1.93 0.0571 .
lbph 0.1216 0.0724 1.68 0.0966 .
svi 0.2732 0.0860 3.18 0.0021 **
lcp -0.1285 0.1082 -1.19 0.2385
gleason 0.0308 0.0966 0.32 0.7507
pgg45 0.1089 0.1061 1.03 0.3072
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.603 on 89 degrees of freedom
Multiple R-squared: 0.663, Adjusted R-squared: 0.633
F-statistic: 21.9 on 8 and 89 DF, p-value: <2e-16```
prostateLasso <- glmnet(x = prostate, y = prostate[, "lpsa"],
exclude = 1, intercept = FALSE)
plot(prostateLasso, label = TRUE, lwd = 2)
Lasso is a regression technique that gives a family of coefficients \( ({}^{s \!}\beta)_{s \geq 0} \) indexed by a tuning parameter \( s \geq 0 \).
You must enable Javascript to view this page properly.
\[ {}^{s \!}\hat{\beta} = \mathop{\arg \min}\limits_{\beta: \|\beta\|_1 \leq 1} \| \mathbf{Y} - \mathbf{X} \beta \|^2_2. \]
where \[ \|\beta\|_1 = \sum_{i=1}^p |\beta_i| \]
You must enable Javascript to view this page properly.
\[ {}^{s \!}\hat{\beta} = \mathop{\arg \min}\limits_{\beta: \|\beta\|_1 \leq s} \| \mathbf{Y} - \mathbf{X} \beta \|^2_2. \]
Lasso defines a family of estimates.
\[ \beta^{\lambda} = \mathop{\arg \min}\limits_{\beta} \| \mathbf{Y} - \mathbf{X} \beta \|^2_2 + \lambda \|\beta\|_1. \]
The monotonely decreasing data dependent map
\[ s(\lambda) = \|\beta^{\lambda}\|_1 \]
from \( (0,\infty) \) to \( (0, s_{\mathrm{max}}) \) gives the relation
\[ \beta^{\lambda} = {}^{s(\lambda) \!}\beta \]
between the contrained lasso and the penalized lasso.
If \( \mathbf{X} \) is orthogonal, i.e. \( \mathbf{X}^T \mathbf{X} = \mathbf{I}_p \), then \[ \beta_i^{\lambda} = \mathrm{sign}(\hat{\beta}_i) \max\{|\hat{\beta}_i| - \lambda, 0\} \] is soft thresholding of the least squares estimator.
This can be compared to hard thresholding \[ \hat{\beta}_i 1(|\hat{\beta}_i| > \lambda) \]
and linear shrinkage \[ \frac{\hat{\beta}_i}{1 + \lambda} \] related to ridge regression.
Lasso is for fixed \( s \) the solution of a quadratic optimization problem with \( 2^p \) linear inequality contraints.
Tibshirani proposes an interative algorithm for adding active contraints. It requires the sequential solution of quadractic optimization problems with linear inequality contraints.
Tibshirani implemented public domain functions for S-PLUS.
To obtain them, use file transfer protocol to lib.stat.cmu.edu and retrieve the file S/lasso, or send an electronic mail message to statlib@lib.stat.cmu.edu with the message send lasso from S.
Homotopy methods as in On the LASSO and its Dual, Osborne, Presnell and Turlach, 2000, JCGS, compute the entire solution path \[ \lambda \mapsto \beta^{\lambda}. \]
The least angle regression (LAR) and its lasso modification (LARS) was developed by Efron et al., and is a fast homotopy algorithm.
prostateLars <- lars(x = prostate[, -1], y = prostate[, "lpsa"],
intercept = FALSE, normalize = FALSE)
This is a pure R implementation!
plot(prostateLars)
coefficients(prostateLars)
lcavol lweight age lbph svi lcp gleason pgg45
[1,] 0.000 0.000 0.0000 0.0000 0.000 0.000 0.0000 0.00000
[2,] 0.365 0.000 0.0000 0.0000 0.000 0.000 0.0000 0.00000
[3,] 0.400 0.000 0.0000 0.0000 0.035 0.000 0.0000 0.00000
[4,] 0.483 0.149 0.0000 0.0000 0.158 0.000 0.0000 0.00000
[5,] 0.487 0.161 0.0000 0.0000 0.166 0.000 0.0000 0.00917
[6,] 0.505 0.182 0.0000 0.0443 0.199 0.000 0.0000 0.03388
[7,] 0.517 0.202 -0.0519 0.0763 0.211 0.000 0.0000 0.05595
[8,] 0.522 0.213 -0.0812 0.0937 0.219 0.000 0.0107 0.06064
[9,] 0.576 0.231 -0.1370 0.1216 0.273 -0.128 0.0308 0.10891
The coordinate descent algorithm solves the lasso optimization problem very efficiently.
1) If you don't get the joke look up the fast Fourier transform.
R. Tibshirani. Regression shrinkage and selection via the lasso: a retrospective, JRSSB 2011.
The least squares loss is replaced by the negative log-likelihood:
\[ \beta^{\lambda} = \mathop{\arg \min}\limits_{\beta} \ell(\beta) + \lambda \|\beta\|_1 \]
With \( K \) groups and \( p \) predictors there will be \( Kp \) parameters.
In the example we have \( K = 9 \) and \( p = 384 \) giving 3456 parameters.
load("miRNA.RData")
miRNAlasso <- glmnet(Xprim, Yprim, family = "multinomial")
par(mfcol = c(3, 3)); plot(miRNAlasso)
miRNAlasso <- cv.glmnet(Xprim, Yprim, family = "multinomial",
type.measure = "class")
plot(miRNAlasso, ylim = c(0, 0.4))
Ordinary lasso penalty
Group lasso penalty
Group lasso penalty
Sparse group lasso penalty
lambda <- msgl.lambda.seq(Xprim, Yprim, alpha = 0, lambda.min = 0.01)
miRNAmsgl <- msgl.cv(Xprim, Yprim, lambda = lambda, alpha = 0)
miRNAlasso <- cv.glmnet(Xprim, Yprim, family = "multinomial",
type.measure = "class",
type.multinomial = "grouped")
plot(miRNAlasso, ylim = c(0, 0.4))
A professional cowboy recognizes the layman by his usage of the word lasso.
To the cowboy it is simply the rope.
My next cool statistical method has to be called Rope.
The is R in Rope, and, hopefully, the professional R users will recognize the layman by his usage of lasso - once we have Rope!
Thanks!
library("glmnet") ## Lasso (and elastic net) via coordinate descent
library("lars") ## Lasso via the lar algorithm
library("msgl") ## Multinomial sparse group lasso
library("reshape2") ## For 'melt'
library("ggplot2") ## Plotting
library("lattice") ## Plotting
library("rgl") ## 3d
library("misc3d") ## More 3d